Towards End-to-End Synthetic Speech Detection
نویسندگان
چکیده
The constant Q transform (CQT) has been shown to be one of the most effective speech signal pre-transforms facilitate synthetic detection, followed by either hand-crafted (subband) cepstral coefficient (CQCC) feature extraction and a back-end binary classifier, or deep neural network (DNN) directly for further classification. Despite rich literature on such pipeline, we show in this paper that pre-transform features could simply replaced end-to-end DNNs. Specifically, experimentally verify only using standard components, light-weight outperform state-of-the-art methods ASVspoof2019 challenge. proposed model is termed Time-domain Synthetic Speech Detection Net (TSSDNet), having ResNet- Inception-style structures. We demonstrate models also have attractive generalization capability. Trained ASVspoof2019, they achieve promising detection performance when tested disjoint ASVspoof2015, significantly better than existing cross-dataset results. This reveals great potential DNNs without features.
منابع مشابه
Towards End-to-End Speech Recognition
Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which al...
متن کاملTacotron: Towards End-to-End Speech Synthesis
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. G...
متن کاملTowards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the...
متن کاملTowards Language-Universal End-to-End Speech Recognition
Building speech recognizers in multiple languages typically involves replicating a monolingual training recipe for each language, or utilizing a multi-task learning approach where models for different languages have separate output labels but share some internal parameters. In this work, we exploit recent progress in end-to-end speech recognition to create a single multilingual speech recogniti...
متن کاملTowards End-to-End Lane Detection: an Instance Segmentation Approach
Modern cars are incorporating an increasing number of driver assist features, among which automatic lane keeping. The latter allows the car to properly position itself within the road lanes, which is also crucial for any subsequent lane departure or trajectory planning decision in fully autonomous cars. Traditional lane detection methods rely on a combination of highly-specialized, hand-crafted...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Signal Processing Letters
سال: 2021
ISSN: ['1558-2361', '1070-9908']
DOI: https://doi.org/10.1109/lsp.2021.3089437